Improved gap size estimation for scaffolding algorithms

نویسندگان

  • Kristoffer Sahlin
  • Nathaniel Street
  • Joakim Lundeberg
  • Lars Arvestad
چکیده

MOTIVATION One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance. RESULTS In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners. AVAILABILITY A reference implementation is provided at https://github.com/SciLifeLab/gapest. SUPPLEMENTARY INFORMATION Supplementary data are availible at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iranian English Language Teachers' Perceptions of Monitoring and Scaffolding Practices of Assessment for Learning: A Focus on Gender and Class Size

Recent innovations in formative assessment have turned the spotlight on the implementation of assessment for learning in the classroom. Notwithstanding a considerable wealth of research on assessment for learning in mainstream education, few research studies in the field of language teaching thus far have touched upon assessment for learning. This quantitative study investigated Iranian English...

متن کامل

Close Following Behavior: Estimation of Desired Gap Headway Using Loop Detector Data (TECHNICAL NOTE)

The desired gap headway of drivers, while close following, represents the main parameter in determining the following distance between vehicles.  This paper uses the raw individual vehicles data taken from loop detectors for millions of vehicles used M25 and M42 in order to estimate the gap headway distributions between successive pairs of vehicles.  The data used in this paper were filtered so...

متن کامل

Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient Yersinia pestis genomes

Yersinia pestis is the causative agent of the bubonic plague, a disease responsible for several dramatic historical pandemics. Progress in ancient DNA (aDNA) sequencing rendered possible the sequencing of whole genomes of important human pathogens, including the ancient Y. pestis strains responsible for outbreaks of the bubonic plague in London in the 14th century and in Marseille in the 18th c...

متن کامل

Spatiotemporal Estimation of PM2.5 Concentration Using Remotely Sensed Data, Machine Learning, and Optimization Algorithms

PM 2.5 (particles <2.5 μm in aerodynamic diameter) can be measured by ground station data in urban areas, but the number of these stations and their geographical coverage is limited. Therefore, these data are not adequate for calculating concentrations of Pm2.5 over a large urban area. This study aims to use Aerosol Optical Depth (AOD) satellite images and meteorological data from 2014 to 2017 ...

متن کامل

Meraculous2: fast accurate short-read assembly of large polymorphic genomes

We present Meraculous2, an update to the Meraculous short-read assembler that includes (1) handling of allelic variation using " bubble " structures within the de Bruijn graph, (2) improved gap closing, and (3) an improved scaffolding algorithm that produces more complete assemblies without compromising scaffolding accuracy. The speed and bandwidth efficiency of the new parallel implementation ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 28 17  شماره 

صفحات  -

تاریخ انتشار 2012